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O Abstract 
(N 

Suppose a finite set X is repeatedly transformed by a sequence of permutations of 
a certain type acting on an initial element x to produce a final state y. We inves- 
tigate how 'different' the resulting state y' to y can be if a slight change is made 
^ to the sequence, either by deleting one permutation, or replacing it with another. 

Here the 'difference' between y and y' might be measured by the minimum number 
of permutations of the permitted type required to transform y to y\ or by some 
^ other metric. We discuss this first in the general setting of sensitivity to pertur- 

bation of walks in Cayley graphs of groups with a specified set of generators. We 
O then investigate some permutation groups and generators arising in computational 

4p genomics, and the statistical implications of the findings. 

, ^ Keywords: evolutionary distance, permutation, metric, group action, genome 

rearrangement s 

> 

1. Introduction 

o 

In evolutionary genomics, two genome^ are frequently compared by the min- 
^ imum number of 'rearrangements' (of various types) required to transform one 

^ genome into another [7] . This minimum number is then used to estimate of the ac- 

^ tual number of events and thereby the 'evolutionary distance' between the species 

I> involved. Since both the precise number and the actual rearrangement events that 

^ occurred in the evolution of the two genomes from a common ancestor are un- 

^ known, it is pertinent to have some idea of how sensitive this distance estimate 



^For the purposes of this paper a genome is simply an ordered sequence of objects - usually 
taken from the DNA alphabet or a collection of genes - which may occur with or without 
repetition, and with or without an orientation (+,-)• 
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might be to the sequence of events (not just the number) that really took place 

m- 

This question has important implications for the accurate inference of evolu- 
tionary relationships between species from their genomes, and we discuss some of 
these further in Section [5j However, we begin by framing the type of mathematical 
questions that we will be considering in a general algebraic context. 

Let G be a finite group, whose identity element we write as 1^, and let S be 
a subset of generators, that is symmetric (i.e. closed under inverses, so x G S* ^ 
x^^ G S). In addition, let T = Cay{G,S) be the associated Cayley graph, with 
vertex set G and an edge connecting g and g' if there exists s G 5* with g' = gs 
(unless otherwise stated, we use the convention of multiplying group elements from 
left to right). For any two elements g,g' G G, the distance ds{g,g') in Cay{G,S) 
is the minimum value of k for which there exist elements si, . . . , of S" so that 
g' = gsi---Sk (for g = g', we set ds{g,g') = 0). Note that ds is a metric, in 
particular, ds{g,g') = ds{g',g), since S is symmetric. 

In this paper, our focus is on the following two quantities: 

Xi{G,S):= max {ds{sg,g)}, 

and 

X2{G,S):= max {ds{sg, s' g)}. 

One way to view these quantities is via the following result which is easily 
proved. 

Lemma 1. Let S be a symmetric set of generators for a finite group G. Then: 

• Xi{G,S) is the maximum value of ds{g,g') between any pair of elements g 
and g' of G for which g = siS2 ■ ■ ■ Sk, and g' = s[s2 ■ ■ ■ s'/^, where s[ = Si E S 
for all but at most one value (say j) for i, and s'j = Iq- 

• X2{G,S) is the maximum value of ds{g,g') between any pair of elements g 
and g' of G for which g = S1S2 ■ ■ ■ Sk and g' = s'^Sj ' ' ' ^'k where s[ = Si E S 
for all but at most one value (say j) for i, and s'j G S, s'j 7^ Sj. 

Thus, Ai(G, S) tells us how much (under ds) a product of generators can change 
if we drop one value of s, whilst X2{G,S) tells us how much (again under ds) a 
product of generators in S can change if we substitute one value of s by another 
s' (see Fig. 1 for an example where A2(G, S) = 6). 

As such, Xm is a measure of the 'sensitivity' of walks in the Cayley graph to a 
switch in or deletion of a generator at some point. Moreover, if G acts transitively 
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[4321] 




[1234] 



Figure 1: The Cayley graph Cay{G, S) for G = S4 (the perniutation group on {1, 2, 3, 4}) and 
the set of transpositions S — {(12), (23), (34)}. Substituting just one element - namely (34) for 
(12) - in the product corresponding to the walk in the lower front face (which starts and returns 
to the lower-most point [1234]) results in a walk that ends at a point ([4321], top) that is very 
distant (under ds) from the end-point of the original walk. In fact, the two end-points are at 
maximal distance in this example. 

and freeljj^ on a set X then provides a corresponding measure of sensitivity of 
this action to a switch in or deletion of a generator (since a transitive, free action of 
G on X is isomorphic to the action of G on itself by right multiplication). Actions 
with large Am values can thus be viewed as exhibiting a discrete, group-theoretic 
analogue of the 'butterfly effect' in non-linear dynamics (see e.g. |9|). 

In the genomics applications that we shall consider, elements of the group G 
correspond to genomes, and ds to the evolutionary distance between them. After 
presenting some general results concerning Am in the next section, in Sections |3] 
and|4]we discuss some applications arising for various choices of G and S. These 
include the Klein four group, which arises in evolutionary models of DNA sequence 
evolution, and the permutatation group, which typically appears when studying 
rearrangement distances between genomes. We conclude in Section [5] with some 
statistical implications of our results. 

One can imagine many other settings besides genomics where similar questions 
arise - for example, in a sequence of moves that should unscramble the Rubik's 



acts transitively on X if for any pair x,y £ X there exists g G G with go x = y; the action 
is free if g o x = h o x ^ g = h, for all 5, ft. G G and x G X, where ' o ' denotes the action of G on 
X. 
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cube from a given position what will be the consequences (in terms of the 
number of moves required) for completing the unscrambling if a mistake is made 
at some point (or one move is forgotten)? In addition, related questions arise in 
the study of 'automatic' groups, where the group under consideration is typically 
infinite 

2. General inequalities 

We first make some basic observations about Cayley graphs and the metric 
ds (further background on basic group theory, Cayley graphs, and group actions 
can be found in f5]). It is well known that F is a connected regular graph of 
degree equal to the cardinality of 5* and that F is also vertex-transitive (see, for 
example, [TT], Proposition 1). Consider the function /g : G — )• {0, 1, 2, 3 . . . \G\}, 
where, Is{^g) = and, for each g E G — {Ig}, hio) is the smallest number I of 
elements si, . . . ,si from S for which we can write g = si ■ ■ ■ s/. The function Is 
clearly satisfies the subadditivity property that, for all g,g' G G: 

ls{gg')<isig) + lsig')- 

In addition, 

lsig-') = lsig), 

and 

ls{g) = l^geS, ls{g) = ^ ^ = Ig- 

Note that ls{gg') is generally not equal to ls{g'g). The metric ds, described in the 
previous section, is related to Is as follows: 

ds{g,g') = ls{g~^g')- 

Consequently, by definition: 

Ai(G,5)= m^^{ls{g-hg)}, (1) 

and 

\2{G,S)= max {ls{g-hs' g)}. (2) 

Let ls{G) = max{ls{g) : g G G}, which is the diameter of Gay{G, S), that 
is, maximum length shortest path connecting any two elements of G. Clearly, 
Ai(G,S),A2(G,5) <lsiG). Moreover: 

X2{G,S)<2-Xi{G,S), (3) 



4 



since, for any g & G and s, s' G S, we have: 

dsisg, s'g) < ds{sg, g) + ds{g, s'g). 

A partial converse to Inequality (|3| is provided by the following: 

Ai(G,5) < A2(G,5) + A;(G,S), (4) 

where X'liG, S) = maxg^zc niins(zs{ls{g^^sg)}. To verify Q, select a pair g G G,s G 
S so that Isig^^sg) = \i{G, S). Then: 

Xi{G, S) = ds{sg, g) < ds{sg, Sig) + ds{sig, g), 

where si is an element s' (possibly equal to s) in 5* that minimizes ls{g~^s'g). 
Now, ds{sg, sig) < X2{G, S) (even if s' = s) and ds{sig,g) < X'i{G, S), and so we 
obtain iQ. 

Note also that if G is Abehan, then Ai(G, 5) = 1, and X2{G,S) < 2 for any 
symmetric set 5* of generators. Moreover, for the Abelian 2-group G = with 
the symmetric set S of generators consisting of all n elements with the identity at 
all but one position, we have ls{G) = n and Xi{G,S) = 1. This shows that the 
inequality Xi{G,S) < ls{G) can be arbitrarily large. Our next result generalizes 
this observation further. 

Lemma 2. Let Gi, G2, . . . ,Gk be finite groups, and let Si be a symmetric set of 
generators ofGi fori = 1, . . . ,n. Consider the direct product G = GixG2X - ■ - xGk 
along with the symmetric set of generators S of G consisting of all possible k- 
tuples which consist of the identity element of Gi at all but one co-ordinate i, 
where it takes some value in Si. Then (i) Xi{G, S) < maxi<j<^. {/^-(G'j)}, and (ii) 

lsiG) = j:ljsAGi). 

Proof: For Part (i), let Xi{G,S) = ls{g~^sg), where s G S' is a non-identity 
element at some co-ordinate u. Notice that {g~^sg)j = Iq for all j 7^ u. Moreover, 
{g~^sg)u = si - ■ ■ si where I < ls^{Gu). Thus ls{g'^sg) < ls^{Gu), as claimed. 

For Part (ii), the inequality ls{G) < Yl'i=i^Si{Gi) is clear; to establish the 
reverse inequality, let gi be an element of Gi with Isiigi) = hi{Gi), and g = 
(g^, ...,gk)eG. Then Ug) = Eli lsAG^), and so /5(G) > Ei=lhiG^)■ ° 

We now consider how Am behaves under group homomorphisms. Suppose H 
is the homomorphic image of a group G under a map p. Let = Ker{p) be the 
kernel of p, which is a normal subgroup of G, and with H = G/N. Thus we have 
a short exact sequence: 

l^N^G^H^l. (5) 

Let be a symmetric set of generators of G. Then Sh = {p{s) : s E S — N} 
is a symmetric set of generators of H. 
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Lemma 3. For m = 1,2, Xm{H, Sh) < \n{G, S). 

Proof: First suppose that m = 1. For x G Sh and h E H, consider h'^xh. There 
exist elements g E G and s G S* — iV for which f{g) = h and f{s) = x. Now the 
element g~^sg G G can be written as a product of at most / = \i{G, S) elements 
of S, that is g~^sg = siS2 ■■■ Sk for k < I. Applying p to both sides of this equation 
gives: h~^xh = p{si)p{s2) ■ ■ ■p(sfc). Notice that some of the elements on right may 
equal the identity element of H (since p{si) = 1h Si E N), but they are elements 
of Sh otherwise. Thus lsff{h~^xh) < I. Since this holds for all such elements h,x, 
Eqn. ([T]) shows that Xi{H,Sh) < Xi{G,S). The corresponding result for m = 2 
follows by an analogous argument. □ 
To obtain a lower bound for Xm{G,S) suppose that the short exact sequence 
([5]) is a split extension, i.e. there is a homomorphism i : if — )■ G so that poi is the 
identity map on H, which (by the splitting lemma) is equivalent to the condition 
that G is the semidirect product of N with a subgroup H' isomorphic to H (i.e. 
G = NH' = H'N, H' nN = {1g})- In this case we have the following bounds. 

Proposition 4. Suppose a finite group G is a semidirect product of subgroups N 
(normal) and H . Let Sjq, Sh be symmetric generator sets for N and H respectively, 
and let S = S^ U Sh which is a symmetric generator set for G. Then: 

Xi{H, Sh) < XiiG, S) < X,iH, Sh) + Is^N). 

In particular, by (Qj, X2{G, S) < 2Xi{H, Sh) + 2/s^(A^)- 

Proof: The lower bound on Xi{G,S) follows from Lemma [S] For the upper 
bound we must show that for all s G 5* and g E G, ds{sg, g) < Ai {H, Sh) + Isn i^) 
holds. We consider two cases: (i) s E N, and (ii) s E H. In Case (i), note that the 
conjugate element g~^sg is also an element of A^; in this case we have the tighter 
bound ds{sg,g) < lsf^{N). In Case (ii), write g = hn where n E N and h E H. 
Consider the word 

w = g~^sg = n~^h~^shn. 

Since N is normal we have n^^{h~^sh) = {h~^sh)n' for some element n' E N. 
Thus w = h~^shn'n. Write w = W1W2 where Wi = h~^sh E H and W2 = n'n E N. 
We can select W2 to be a product of terms of S^ of length at most /g^ (iV) and, by 
Inequality we can select Wi to be a product of terms of Sh of length at most 
Xi{H, Sh)- Thus w can be written as a product of, at most, Xi{H, Sh) + ^s'jv(A^) 
elements of S. □ 
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3. Permutation groups and genomic applications 

We first describe a direct application that is relevant to the evolution of a 
DNA sequence under a simple model of site substitution (Kimura's 3ST model) 
^lOj . Consider the four- letter DNA alphabet A = {A, C, G, T} and the Klein four- 
group A' = Z2 X Z2 with an action on A in which the three non-zero elements of K 
correspond to 'transitions' (A ^ G, C f-)- T) and the two types of 'transversions' 
(A^C, G^T; and Af-J-T, G^C). This representation of the Kimura 3ST model 
was first described and exploited by [6]. 

For g & K and x G let (7 o x denote the element of A obtained by the action 
of (7 on X (the identity element fixes each element of A). The resulting component- 
wise action of on A"', defined by: {gi, . . . gn) o (a^i, • • • Xn) = ((71 0x1, ... , gn°Xn), 
can be regarded as the set of all changes that can occur to a DNA sequence over 
a period of time under site substitutions. 

Now, under any continuous-time Markovian process these change events ('site 
substitutions') occur just one at a time and so a natural generating set of is the 
set Sn of all elements of that consist of Ik at all but one co-ordinate. Moreover, 
since the action of on A^' is transitive and free (and so is isomorphic to the 
action of K"' on itself by right multiplication), Xm{K^, Sn) measures the impact of 
ignoring (for m = 1) or replacing (for m = 2) one substitution in a chain of such 
events over time. As i^" is Abelian, one has \i{K"', Sn) = 1 and X2{K"', Sn) = 2, 
which implies that this impact is minor, and, more significantly, is independent 
of n; this has important statistical implications which we will describe further in 
Section |5l 

For a related example, consider the ordered sequence of distinct genes ((71, (72, ... , gn) 
partitioned into regions Ri, R2, ■ ■ ■ Rk so that genomic rearrangements occur within 
each region, but not between regions (e.g. Ri might refer to different chromo- 
somes). This situation can be modelled by the setting of Lemma |2] in which Gi 
is a permutation group on the genes within Ri, and Si is set of elementary gene 
order rearrangement events that generates Gi (we discuss some examples below). 
In this case. Lemma [2] provides a bound on Ai and A2 that is independent of the 
number of regions k. 

We turn now to the calculation of Am(S„,,5') for the permutation group 
on n\ elements and various sets S of generators. This group commonly arises 
when studying genome rearrangements [11]. Our main interest is to determine, 
for each instance of S, whether there is a constant C (independent of n) for which 
Am(E„, S) < C, for m = 1, 2. 

A permutation g on the set [n] := {1,2,..., n} is a bijective mapping from [n] 
to itself. We will also write g as g = [gi,g2, ■ ■ ■ ,gn] where gi = g{i) is the image 
of the map g for z e [n]. Note that, following the usual convention, the product 
gg' of two permutations g, g' G S„ will be considered as the composition of the 
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functions g and g' . In particular, gg'{i) = g{g'{i)) for all i G [n]. 

When studying genomes, each entry gi of a permutation g corresponds to a gene 
and the full list [gi,g2, . . . , gn] to a genome. Multiplying g hj a permutation leads 
to a rearrangement of the genome. For example, multiplying by a transposition 
tij interchanges the values at positions i and j of g, i.e. [. . . , g^, . . . , gj, . . . ]tij = 
[. . . , gj, . . . , gi, . . .], and multiplying by a reversal rij reverses the segment [gi, gj], 
^ < j ^ of g, i.e. 

[• • • iQhgi+l-, ■ ■ ■ -iQj-liQj-, ■ ■ - Vij = [• • • iSj^dj-li ■ ■ ■ iQi+l-iQi-, • • •]• 

Such rearrangements are widely observed and studied in molecular biology [7]. 

In genomics applications, we are often interested in defining some distance 
between genomes. One distance that is commonly used in the context of permuta- 
tions is the breakpoint distance [171 7.3]. For g, g' G S„, dBp{g, g') is defined as the 
number of pairs of elements that are adjacent in the hst [Q,gi,g2, . . . , gn,n + 1], 
but not in the list [0, g[,g'2, ■ ■ ■ ,g'n,n + 1]. For example, ii g = [1, 2, 3, 4, b\,g' = 
[1,4,3,2,5] G S5, we have dBp{g,g') = 2. It is clear that max{dBp{g, g) ■ g,g' & 
Sn} = n + l. 

Alternatively, one can consider the rearrangment distance between two genomes, 
i.e. the minimal number of operations of a certain type (such as transpositions or 
reversals) that can be applied to one of the genomes to obtain the other 0. In 
terms of Cayley graphs, this distance can be conveniently expressed for transposi- 
tions and reversals as follows. Let 

T = T„ := {t,j G S„ : I <i < j <n}, 

C = Cn:= {t^+i e T : 1 < z < n - 1}, 
(the Coxeier generators), and 

R := {rjj G Sn : 1 < ^ < j < n}. 

Note that all three of these sets generate S„ [TT] and that they are all symmetric, 
since each generator is its own inverse. The metric dg, S = T,C,R, is precisely 
the rearrangement distance. 

The diameters of CayillniT) and Cay{J]n, R) are both n — 1, and the diameter 
ofCai/(S„,C) is m- 

Regarding the quantities Am(S„,5'), we have the following result for 5* = 
T, C, R: 

Theorem 5. For n > 7 the following hold: 
(i) Ai(S„,T„) = 1 and A2(S„,T„) = 2. 
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1 2 3 4 5 6 7 



1 2 3 4 5 6 7 



A7J\ 

3 2 5 4 6 7 1 




(a) 



5 6 3 4 1 2 7 
(b) 



Figure 2: (a) A diagrammatic respresentation of the element g = [3, 2, 5, 4, 6, 7, 1] in S7, defined in 
the proof of Theoremjs] (iii). (b) The product rijg = [5, 6, 3, 4, 1, 2, 7]. Note that dBp{rijg, g) — 



(a) Ai(S„, Cn) = 2n — 3 and 2n — 2 < A2(S„, C„) < 4n — 6. 
(ill) ^ < A„(S„, Rn) <n-l, m = l, 2. 
Proof: (i) Note that if (7 G S„ and tjj- G T, then: 



(6) 



Therefore Ai(E„,T) = 1 by ([T]). Thus, by Inequahty (|3|, we have A2(Sn,T) < 

2. The equahty A2(Sn;^) = 2 follows by ([2]) and the fact that g'^t^iti^g = 
^a-Hi).g-H3)^9-Hk),g-Hi) ^^^'^^ for any 5- G S„ and l<ii<j<fc</<n. 

(ii) Consider the permutation G S„ given hy g = [2, 3, . . . , n — 1, n, 1]. Then 
g^^ti,2g = [n, 2, 3, . . . , n— 1, 1]. Therefore, lc{.g^^ti,2g) > 2?2 — 3 (since to transform 
[n, 2, 3, ... , n — 1, 1] to ls„ requires moving 1 and n back to their original positions). 
Therefore, Ai(S„,C) > 2n - 3 by (|l|). But, by Equality (§), Ai(S„,C) < 2n - 3, 
since any transposition is the product of at most 2n— 3 elements in C. In particular, 
Ai(E„,C) =2n-3. 

Similarly, /c(fi'~^^i, 2^3,45') > 2n — 2, and so A2(Sn,,C) > 2n — 2 by ([2|. Hence, 
by Inequality ([3]), we have A2(Sn, C) < 2{2n — 3). 

(iii) The inequality Am(S„,i?„) < n — 1, m = 1,2 follows as the diameter of 
Cay(S„, i?) is at most n — 1. 

Now, suppose n is odd. Let (7 G S„ be given by (7 = [3, 2, 5, 4, 7, 6, . . . , n — 

3, n, n — 1, 1]. Then it is straight-forward to check that dBp{ri^ng, g) = n + 1 (see 
Figure 2 for the case n = 7). In particular, since the length of any shortest path 
in Cay(T,n, R) joining any g,h ET,n is at least dBp{h, g)/2 by [171 P-238], we have 
Ai(S„,i?) > Similarly, c?Bp(r2,3n,n5', 5') = n + 1 for any g G and so 



A2(Sn,-R) > 

In case n is even, consider g = [3, 2, 5, 4, 7, 6, . . . , — 4, n — 1, n — 2, 1, . Then 

dBp{j'2,ng,g) = n + 1 and dBp{r3^4^r2^ng, g) = n + 1. Similar reasoning yields the 

desired result. □ 
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In genomics, the direction in which a gene is oriented in a genome can also 
provide useful information to incorporate in rearrangement models, which can be 
expressed as follows in terms of Cayley graphs [H]. The hyperoctahedral group 5„ 
is defined as the group of all permutations g'^ acting on the set {±1, . . . , ±n} such 
that g'^{—i) = —g"{i) for all i G [n]. An element of is a signed permutation. 
Signed versions of transpositions and reversals can be defined in the obvious way; 
a sign change transposition tl^ switches the values in the zth and jth positions of 
a signed permutation as well as both of their signs and so forth. Note that we 
also allow i = j for signed transpositions and reversals so that ti^i = Vi^i, z G [n], 
simply switches the sign of the ith value. We denote the set of signed elements 
corresponding to those in S' = T, C, R, together with the elements tj^j, I < i < n, 
by S"". Note that the diameter of Cay{Bn, i?"^) is n + 1 [H]. 

Now, regarding the group i?„ as a wreath product [HI p. 2756], we have a 
short exact sequence: 

1 ^ AT ^ S„ A S„ ^ 1, (7) 

where the homomorphism p : i?„ — ?■ S„ sends g^ G -B„ to the permutation of [n] 
that maps i to \g'^{i)\ (i.e. it ignores the sign). Notice that p maps 5*°" onto S 
when S = T,C, R. In particular, from Lemma [3} the following holds for m = 1, 2: 

Am(-B„,S"^) > \mi^n,S). (8) 

Moreover, = Ker{p) is isomorphic to the elementary Abelian 2-group Z2 
and the short exact sequence in ([T]) splits, so B^ is a semidirect product of Z2 and 
a subgroup isomorphic to S„. Using these observations, we obtain: 

Corollary 6. For n > 7, the following hold: 
(t) MB^.T^) < 3 and A2(5„,T-) < 6. 

(11) 2ra - 3 < Ai(fi„, <2n-l and 2n - 2 < X2{Bn, Q) < 4n - 2. 
(ill) ^ < A„(5„, i?^^) < n + 1, m = 1, 2. 

Proof: The inequalities Ai (i?„,T^) < 3 and Ai (i?„,C^) < 2r?, — 1 follow from 
similar arguments to those used in the proof of Theorem |5] (i) and (ii), using 
the signed analogue of Equation ([6]). Inequality ^ then implies that inequalities 
A2(5„, T^) < 6 and Aal^^, C^) <An-2 both hold. The inequality A^(E„, R^) < 
n + 1, m = 1,2, follows as the diameter of Cay{Bn, R'^) is at most n + 1. The 
inequalities 2n — 3 < Ai(i?„, C^) and 2n — 2 < A2(-Bn, Cn)y the remaining ones 
in (iii) follow by Inequality ^ and Theorem [s] □ 
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4. Beyond dg: properties of breakpoint distance 

As we have seen for the breakpoint distance on S„ in the last section, it can 
sometimes be useful to consider metrics on a group other than the distance ds 
arising from some Cayley graph. Motivated by this, given an arbitrary metric d 
on a finite group G, with symmetric generator set S, we define: 

Xi{G,S,d):= max {d{sg,g)} and X2{G, S,d) := max {d{sg,s'g)}. 

In particular, Xm{G,S) = Xm{G,S,ds) and Xjn{G,S,d) < maXg^g>^G{d{g, g')}, 
m = 1,2. Moreover, the following analogue of Inequality ([s]) for an arbitrary 
metric d on G is easily seen to hold: 

X2{G,S,d) <2- Xi{G,S,d). (9) 

Note that, although the quantities Xm{G,S) and Xm{G,S,d) need not be di- 
rectly related to one another, in certain circumstances, they are. For example, if 
d has the property that d{g, gs) < c for some constant c it is an easy exercise to 
show that Xm{G, S,d) < c ■ Xm{G, 5), for m = 1, 2. 

We now return to considering the breakpoint distance dsp- In genomics, this 
distance is commonly used as a proxy for rearrangement distances. Thus it is of 
interest to note: 

Lemma 7. For n > 7, the following hold: 

dsp) < 4 and A2(Sn, T^, dsp) < 8. 

(u) Xi{T.n,Cn,dBp) <4 and X2(T.n,Cn,dBp) < 8. 

(ill) ^ < A„(S„, Rn, dpp) <n + l, m = l,2. 

Proof: Suppose t = E Tn, I < i < j < n. Using Equation (|6]), it is 
straightforward to see that dppitg^g) < 4 holds for any g G S„. Therefore 
Ai(E„, T„, (i^p), Ai(Sn, C„, d^p) < 4. The inequalities in (i) and (ii) involving 
A2 now follow from Inequality ([9]). 

The Inequalities in (iii) follow from the argument used in the proof of Theoremjs] 
(iii) and the diameter of (ipp on S„. □ 

In particular, for C, the set of Coxeter generators of S„ in the last section, 
and m = 1, 2, we have Am(S„, C) >2n — 3, but Am,(S„, C, dpp) < 4. Intriguingly, 
this observation can be extended as follows. For > 1, let R^^\ denote the set 
of reversals of the form {rjj- : 1 < i < j < n,\i — j\ < k}. Such 'fixed-length' 
reversals have been considered in the context of genome rearrangements in e.g. [2j. 
Note that = C and R^^^ C R^''+^\ so that R^^^ generates 
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Proposition 8. For n>7,n>k>l and m = 1,2, 

A™(S„,i?('=))>2[^l-2, 

and 

XUJ:n,R^'\dBp)<^k + l). 

Proof: As in the proof of Theorem |5] (ii), let G S„ be given hj g = [2,3, . . . ,n — 
l,n, 1], so that g'^ri^29 = [n,2,3, ...,n-l, 1]. Then, I R(k){g~^r 1^29) > 2[fl - 3, 
since to transform [n, 2, 3, . . . , 1] to ls„ requires moving 1 and n back to their 
original positions. Similarly, lc{g~^^i,2fzA9) — 2[fl — 2. This gives the first 
inequality in the proposition. Moreover, if rij,rp^q G R^''\ then it is straight- 
forward to see that dBpiji^jg, g) < 2{k + 1) and dBpirp^gVijg, g) < 4(A; + 1) holds, 
which gives the second inequality in the proposition. □ 
This proposition implies that in genomics applications, adding or substituting 
a single reversal in a sequence of reversals in i?^^^ could potentially have a large 
effect on dj^ik), but a relatively small effect on dsp (especially for large values of n, 
e.g. there are n > 20,000 genes in the human genome). It could be of interest to 
see whether other combinations of generating sets and metrics for E„ commonly 
used in genomics (such as transpositions [13] and the fc-mer distance ^U\) exhibit 
a similar type of behaviour. 

5. Statistical implications 

So far we have considered metric sensitivity from a purely combinatorial and 
deterministic perspective. But it is also of interest to investigate the sensitivity 
of the metrics discussed above when the elements of S are randomly assigned. 
Again, the motivation for this question comes from genomics, where stochastic 
models often play a central role (see, for example, [H], [22]). In this section, 
we establish a result (Proposition |9| in which the quantity A2 plays a crucial 
role in allowing underlying parameters in such stochastic models to be estimated 
accurately given sufficiently long genome sequences. Our motivation here is to 
provide some basis for eventually extending the well-developed (and tight) results 
on the sequence length requirements for tree reconstruction under site-substitution 
models (see e.g. [31 |5l El [H]) to more general models of genome evolution. 

Consider any model of genome evolution, where an associated transformation 
group G acts freely on a set X of genomes of length n, and for which events in some 
symmetric generating set S occur independently according to a Poisson process. 
Regard the elements of X as leaves of an evolutionary (phylogenetic) tree with 
weighted edges [18], and let fi{x, y) be the sum of the weights of the edges of the 
tree connecting leaves x, y. Then we make the following assumption: 
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• The expected number of times that s G S occurs along the path in the tree 
connecting x and y can be written as n ■ fis{x,y) (i.e. we assume that the 
rate of events scales linearly with the length of the genome). 

Let n{x,y) = ^sg5 Ais(a^, 2/)- Then the total number of events in S that occur 
on the path separating x and y has a Poisson distribution with mean n ■ fi{x,y). 

Now suppose d is some metric on genomes that satisfies the following three 
properties: 

(i) d{x,g o x) depends just on g, for each x E X and g E G. 

(ii) \2iG, S, d) is independent of n. 

(iii) d = nf{fi{x, y)), where d is the expected value in the model of d{x, y) and / 
is a function with strictly positive but bounded first derivative on (0, oo). 

An example to illustrate this process is site substitutions, under the Kimura 
3ST model, described at the start of Section [3| taking d = ds, where we observed 
that Properties (i) and (ii) hold (note that in this case, d{x, y) is the 'Hamming 
distance' between the sequences which counts the number of sites at which x and 
y differ). In that case. Property (iii) also holds, since 

- 3 

d = n-{l- exp(-4/i(x, 2/)/3)). 

Note that, both breakpoint distance and ds satisfy (i), and we have described 
above some cases where (ii) is satisfied. Whether (iii) holds (or the assumption 
that the expected number of events scales linearly with n) depends on the details 
of the underlying stochastic process of genome rearrangement. For example, for 
the approximation to the Nadeau- Taylor model of genome rearrangement studied 
in Section 2 of [21] , Property (iii) holds under the assumption that the number of 
events separating x and y has a Poisson distribution whose mean scales linearly 
with n (the proof relies on Corollary 1(a) of [21]). 

The following result shows how d/n can be used to estimate f{fi{x,y)) ac- 
curately, and thereby fJ.{x,y) (by the assumptions regarding /). The abihty to 
estimate ii{x, y) accurately provides a direct route to accurate tree reconstruction 
by standard phylogenetic methods (such as 'neighbor-joining' [16j) since fi{x,y) is 
'additive' on the underlying tree but not on alternative binary trees (for details, 
see [18]). 

Proposition 9. Consider any stochastic model of genome evolution for which 
events in S occur according to a Poisson process with a rate that scales linearly with 
n, and any metric d that satisfies conditions (i) -(iii) above. Then the probability 
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that d{x,y)/n differs from f{fi{x,y)) by more than z converges to zero exponen- 
tially quickly with increasing n. More precisely, for constants b > and c > that 
depend just on fi{x,y) and on the pair {X2{G, S,d), fi{x,y)), respectively, we have: 

F(\d/n — y))\ > z) < exp(— fen) + 2exp(—cz^n), 

for d = d{x, y). 

Proof of Proposition^ We first recall the Azuma-Hoeffding inequality (see e.g. [T]) 
in which Xi, X2, . . . , are independent random variables taking values in some 
set 5, and h is any real-valued function defined on 5* that satisfies the following 
property for some constant ^: 

|/l(xi,X2, . . . - /l(x^.,X2, . . . < ^, 

whenever (xj) and (x^) differ at just one coordinate. In this case, the random 
variable Y := h{Xi, X2, . . . , X^) has the tight concentration bound for all A; > 1: 

P(|r-E[r]|>2;)<2exp(-^). (10) 

We apply this general result as follows. Let K be the random total number of 
events in S that occur in the path separating x and y. By assumption, K has 
a Poisson distribution with mean n ■ fi{x,y). Conditional on the event K = k, 
let Xi, . . . ,Xk be the actual elements of S that occur. It is assumed that these 
events are independent. Moreover, by (i), d{x,y) is a function of Xi, . . . ,Xfc, and 
by (ii) this function satisfies the requirements of the Azuma-Hoeffding inequality 
for ^ = \2{G, S, d). Thus (10) furnishes the following inequality: 

— z'^Tl'^ 

¥{\d/n-d/n\>z\K = k)<2ex.Y>{ —). (11) 

Invoking Property (iii) and the law of total probability, we obtain: 

¥{\d/n- f{^i{x,y))\ >z) = ^¥{\d/n -d/n\ > z\K = k)¥{K = k), 

k>0 



from which (11) ensures the inequality: 

P(|d/n-/(Mx,y))| >z)< 2E[exp(- — )], (12) 

2 2 

where E denotes expectation with respect to K. Let us write E[exp(— ^^)] as a 
weighted sum of two conditional expectations: 

E[exp(-^^)|ir > 2ri-Mx,i/)]-p+E[exp(-^^)|K < 2n- ^[x,y)\{\-p), (13) 
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where p = F{K > 2n ■ fi{x,y)). 
F{K > 2n ■ fi{x,y)) since exp(- 



2X^K 



The first term in (13) is bounded above by 
< 1; moreover, since K has a Poisson 



distribution with mean n ■ iJ,{x, y) (and so is asymptotically normally distributed 
with mean and variance equal to ^n), the quantity F{K > 2n ■ fi{x, y)) is bounded 
above by a term of the form exp(— 6n) where h depends just on /i(x, y). 

The second term in (13) is bounded above by exp(— ^^2^^^" ^-^ ), where A = 
X2{G, S, d), since the function x (-)■ exp{—A/x) increases monotonically on [0, 00). 



Combining these two bounds in (13), the result now follows from (12). 



□ 



Remark. Referring again to the particular case of site substitutions under the 
Kimura 3ST model. Proposition [9] can be strengthened to: 

F{\d/n- f{fi{x,y))\ > z) < 2exp(-cVn), 

where c' > can be chosen to be independent of fi{x,y). This stronger result is 
the basis of numerous results in the phylogenetic literature that show that large 
trees can be reconstructed from remarkably short sequences under simple site- 
substitution models [5]. Although the bound in Proposition [o] is less incisive, it 
would be of interest to explore similar phylogenetic applications for other mod- 
els of genome evolution in which A2 is independent of n, such as those involving 
breakpoint distance under reversals of fixed length. 
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